Semi-Semantic Part of Speech Annotation and Evaluation
نویسنده
چکیده
This paper presents the semi-semantic part of speech annotation and its evaluation via Krippendorff’s α for the URDU.KON-TB treebank developed for the South Asian language Urdu. The part of speech annotation with the additional subcategories of morphology and semantics provides a treebank with sufficient encoded information. The corpus used is collected from the Urdu Wikipedia and news papers. The sentences were annotated manually to ensure a high annotational quality. The inter-annotator agreement obtained after evaluation is 0.964, which lies in the range of perfect agreement on a scale. Urdu is comparatively an under-resourced language and the development of the treebank with rich part of speech annotation will have significant impact on the state-of-the-art for Urdu language processing.
منابع مشابه
Linguistic Annotation for the Semantic Web
Establishing the semantic web on a large scale implies the widespread annotation of web documents with ontology-based knowledge markup. For this purpose, tools have been developed that allow for semi-automatic annotation of web documents with ontology-based metadata. However, given that a large number of web documents consist either fully or at least partially of free text, language technology ...
متن کاملAnnotating Geographical Entities
This paper describes a study based on exploration of relations between geographical entities. We suggested a new tool for training and evaluation required by related annotation experiments. It relates to an annotator used for semi-automatic annotation, starting with the geography manual. We define fifteen types of entities: location, geo_position, geology, landform, clime, water, dimension, per...
متن کاملBuilding Computational Resources: The URDU.KON-TB Treebank and the Urdu Parser
This work presents the development of the URDU.KON-TB treebank, its annotation evaluation & guidelines and the construction of the Urdu parser for a South Asian language Urdu. Urdu is comparatively an under-resourced language and the development of a reliable treebank and a parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The work includes the ...
متن کاملSemantic-Based Image Retrial in the VQ Compressed Domain using Image Annotation Statistical Models
متن کامل
Annotation of Clinical Narratives in Bulgarian language
In this paper we describe annotation process of clinical texts with morphosyntactic and semantic information. The corpus contains 1,300 discharge letters in Bulgarian language for patients with Endocrinology and Metabolic disorders. The annotated corpus will be used as a Gold standard for information extraction evaluation of test corpus of 6,200 discharge letters. The annotation is performed wi...
متن کامل